Unit 6: Text Data Processing 


e attackSits own joint tissue 
e vaccinated against diphtheria, whooping cough, and tetanus 


e developing the disease of Crohnand the assumption of high quantities of 
glucide 


The Text Data Processing Entity Extraction transform outputs entities and facts in a flat 
structure for easy consumption by other transforms. Keep in mind that there are inherent 
relationships between output rows. For example, the entity Peter Rezle/PERSON can be 
broken down into Peter/PERSON_GIV and Rezle/PERSON_GIV sub-entities, with each output 
as a Separate row. 


The /D and PARENT_ID maintain any relationship between the output rows for a piece of text. 
The STANDARD_FORM column represents the longest, most precise, or official name 
associated with the corresponding TYPE column. For example, Peter Rezle might be 
mentioned as Pete Rezle elsewhere in the content. 


The CONVERTED_TEXT column value represents the possibly transcoded input text. This can 
be used to refer to the location of an extraction using the LENGTH and OFFSET column values 
for an entity or fact. 


Table 56: Entity Extraction — Output Fields 


ID fint Represents a parent-child relationship between rows. 


PARENT_ID int If present, provides a link to a parent ID value. If not 
present, the default value is set to -1. 


STAND- var- The longest, most precise, or official name associated 
ARD_FORM char(2000) | with the value of the corresponding TYPE column 
TYPE varchar(255) | The type of an entity or fact. It can also represent subenti 
ty types, subtypes, and subfact types 
SOURCE_FORM |var- The name of an entity, subentity, fact, or subfact as men 
char(2000) | tioned in the input text 





PARAGRAPH_ID A unique identifier or the paragraph in the CONVERT- 
ED_TEXT field containing the entity or fact 

SENTENCE_ID i A unique identifier or the sentence in the CONVERT 
ED_TEXT field containing the entity or fact 

CONVERT- long The content text representation in UTF-16 encoding of the 

ED_TEXT input text. 


LENGTH int The character length of an entity or fact in the CONVERT 
ED_TEXT field 
int 


SOURCE varchar(10) |The origin of an entity or fact. One of: SYSTEM, DICTION 
ARY, or RULE. 


OFFSET The character offset of an entity or fact in the CONVERT- 
ED_TEXT field 
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